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CN . Abstract 

The tandem mass spectrometry fragments a large number of molecules of the same peptide 
(— ^ , sequence into charged prefix and suffix subsequences, and then measures mass/charge ratios of 

qq ■ these ions. The de novo peptide sequencing problem is to reconstruct the peptide sequence from a 

given tandem mass spectral data of k ions. By implicitly transforming the spectral data into an 
NC-spectrum graph G = (V, E) where \V\ — 2k + 2, we can solve this problem in 0(|V| + \E\) time 
and 0(| y |) space using dynamic programming. Our approach can be further used to discover a 
modified amino acid in 0(11^1 time and to analyze data with other types of noise in 0(|\^||S|) 
time. Our algorithms have been implemented and tested on actual experimental data. 



u 



O 



> 



o 



3 



1 Introduction 



The determination of the amino acid sequence of a protein is the first step toward solving the structure 
and the function of this protein. Conventional sequencing methods || cleave proteins into peptides 
and then sequence the peptides individually using Edman degradation or ladder sequencing by mass 
spectrometry or tandem mass spectrometry Q. Among such methods, tandem mass spectrometry 
combined with microcolumn liquid chromatography has been widely used as follows. A large number 
of molecules of the same but unknown peptide sequence are selected from a liquid chromatogra- 
^ pher and a mass analyzer. Then they are fragmented and ionized by collision-induced dissociation. 

Finally all the resulting ions are measured by the tandem mass spectrometer for mass/charge ra- 
tios. In the process of collision-induced dissociation, a peptide bond at a random position is broken, 
^ ■ and each molecule is fragmented into two complementary ions, typically an N-terminal b-ion and 
a C-terminal y-ion. For example, if the ith peptide bond of a peptide sequence of n amino acids 
(NH 2 CHR 1 C0 - NHCHR 2 C0 - • • • - NHCHR n COOH) is broken, the N-terminal ion corresponds to a charged 
prefix subsequence (NH2CHR1CO — • • • — NHCHRiCO + ) and the C-terminal ion corresponds a charged suf- 
fix subsequence (NH^CHRi+iCO — • • • — NHCHR n COOH + H + ). This process fragments a large number of 
molecules of the same peptide sequence, and therefore the resulting ions contain almost all possible 
prefix subsequences and suffix subsequences, and display a spectrum in the tandem mass spectrom- 
eter. All these prefix (or suffix) subsequences form a sequence ladder where two adjacent sequences 
differ by one amino acid. In the tandem mass spectrum, each ion appears at the position of its mass 
because it carries a +1 charge. 

*A preliminary version appeared in Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, 
pages 389-398, 2000. 
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Hypothetical tandem mass spectrum of peptide DII 
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Figure 1: Hypothetical tandem mass spectrum of peptide DII. 



Figure |l] shows all the ions of the peptide DII in a hypothetical tandem mass spectrum. The 
interpretation of a real tandem mass spectrum has to deal with the following two factors: (1) some 
ions may be lost in the experiments and the corresponding mass peaks disappear in the spectrum; 
(2) it is unknown whether a mass peak corresponds to a prefix or a suffix subsequence. The de novo 
peptide sequencing problem takes an input of a subset of prefix and suffix masses of a target peptide 
sequence P and asks for a peptide sequence Q such that a subset of its prefixes and suffixes gives the 
same input masses. Note that as expected, Q may or may not be the same as P, depending on the 
input data and the quality. 

In practice, other factors can also affect a tandem mass spectrum. An ion may display two or 
three different mass peaks because of the distribution of two isotopic carbons, C 12 and C 13 , in the 
molecules. An ion may lose a water or an ammonia molecule and displays a different mass peak from 
its normal one. An amino acid at some unknown location of the peptide sequence is modified and 
the mass is changed. This modification appears in every molecule of this peptide, and all the ions 
containing the modified amino acid display different mass peaks from the unmodified ions. Finding 
the modified amino acid is of great interest to biologists because the modification is usually associated 
with protein functions. 

Several computer programs have been designed to interpret the tandem mass spectral data. A 
popular approach || is to correlate peptide sequences in a protein database with the tandem mass 
spectrum. Peptide sequences in the database are converted into hypothetical tandem mass spectra, 
which are matched against the target spectrum using some correlation functions, and the sequences 
with top scores are reported. This approach gives an accurate identification, but cannot handle the 
peptides that are not in the database. Also, it does not scale up very well with the length of a protein 
and the size of a protein database because the number of peptides for a protein grows quadratically 
with the length of the protein. Pruning techniques have been applied to screen the peptides before 
matching but at the cost of reduced accuracy. 

An alternative approach [Q] is de novo peptide sequencing. The peptide sequences are extracted 
from the spectral data before they are validated in the database. First, the spectral data is trans- 
formed to a directed acyclic graph, called a spectrum graph, where (1) a node corresponds to a mass 
peak and an edge, labeled by some amino acids, connects two nodes differed by the total mass of the 
amino acids in the label; (2) a mass peak is transformed into several nodes in the graph, and each 
node represents a possible prefix subsequence (ion) for the peak. Then, an algorithm is called to find 
a longest or highest-scoring path in the graph. The concatenation of edge labels in the path gives 
one or multiple candidate peptide sequences. However, the well-known algorithms [jl] for finding the 
longest path tend to include multiple nodes associated with the same mass peak. This interprets a 
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mass peak with multiple ions of a peptide sequence, which is rare in practice. This paper provides 
efficient sequencing algorithms for a general interpretation of the data by restricting a path to contain 
at most one node for each mass peak. 

For this purpose, we introduce the notion of an NC-spectrum graph G = (V, E) for a given tandem 
mass spectrum, where E = 2k + 2 and k is the number of mass peaks in the spectrum. In conjunction 
with this graph, we develop a dynamic programming approach to obtain the following results for 
previously open problems: 

• The de novo peptide sequencing problem can be solved in 0(|V| + \E\) time and 0(|V|) space 
for clean spectral data, and in 0(|V||.E|) time and 0(|y| 2 ) space for noisy data. 

• A modified amino acid can be found in 0(|V||i£|) time. 

Our paper is organized as follows. Section |2| formally defines the NC-spectrum graph and the 
peptide sequencing problem. Section Q describes the dynamic programming algorithms. Section 4 
refines the algorithms for the data with a modified amino acid and other types of noise. Section 5 
reports the implementation and testing of our algorithms on experimental data. Section ^ mentions 
further research. 

2 Spectrum graphs and the peptide sequencing problem 

Given the mass W of a target peptide sequence P, k ions I\, . . . , I& of P, and the masses w\, . . . ,Wk 
of these ions, we create the NC-spectrum graph G = (V, E) as follows. 

For each Ij, it is unknown whether it is an N-terminal ion or a C-terminal ion. If Ij is a C-terminal 
ion, it has a complementary N-terminal ion, denoted as Ij, with a mass of W — Wj. Therefore, we 
create two complementary nodes Nj and Cj to represent Ij and Ij, one of which must be an N-terminal 
ion. We also create two auxiliary nodes Nq and Cq to represent the zero-length and full-length N- 
terminal ions of P. Let V = {Nq, N%, N^, Co, C\, C&}. Each node x S V, is placed at a real 
line, and its coordinate cord(x) is the total mass of its amino acids, i.e., 



This coordinate scheme is adopted for the following reasons. An N-terminal b-ion has an extra 
Hydrogen (approximately 1 dalton), so cord(Nj) = Wj — 1 and cord(Cj) = (W—(wj — l)) — l = W—Wj) 
and the full peptide sequence of P has two extra Hydrogens and one extra Oxygen (approximately 16 
daltons), so cord(Cb) = W — 18. If cord(A r j) = cord(Cj) for some i and j, Ii and Ij are complementary: 
one of them corresponds to a prefix sequence and another corresponds to the complementary suffix 
sequence. In the spectrum graph, they are transformed into one pair of complementary nodes. We 
say that Nj and Cj are derived from Ij. For convenience, for x and y G V, if cord(x) < cord(y), then 
we say x < y. 

The edges of G are specified as follows. For x and y € V, there is a directed edge from x to y, 
denoted by E(x, y) = 1, if the following conditions are satisfied: (1) x and y are not derived from the 
same Ij) (2) x < y; and (3) cord(y) — cord(x) equals the total mass of some amino acids. Figure |2| 
shows a tandem mass spectrum and its corresponding NC-spectrum graph. 

Since G is a directed graph along a line and all edges point to the right on the real line, we list 
the nodes from left to right according to their coordinates as Xq,x\, . . . ,x^, yi~, . . . , y\, yo. 




Wj — 1 x = Nj for j = 1, . . . , k; 
W — Wj x = Cj for j = 1, . . . , k. 
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A tandem mass spectrum of peptide DII (359.21 dalton) 
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Figure 2: A tandem mass spectrum and its corresponding NC-spectrum graph. 
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Lemma 1 The peptide sequencing problem is equivalent to the problem which, given G = (V,E), 
asks for a directed path from xo to yo which contains exactly one of Xj and y.j for each j > 0. 

Proof. If the peptide sequence is known, we can identify the nodes of G corresponding to the 
prefix subsequences of this peptide. These nodes form a directed path from xq to yo- Generally the 
mass of a prefix subsequence does not equal the mass of any suffix subsequence, so the path contains 
exactly one of xj and y.j for each j > 0. 

A satisfying directed path from xo to yo contains all observed prefix subsequences. If each edge 
on the path corresponds to one amino acid, we can visit the edges on the path from left to right, and 
concatenate these amino acids to form a peptide sequence that display the tandem mass spectrum. 
If some edge corresponds to multiple amino acids, we obtain more than one peptide sequences. 

Even if the mass of a prefix subsequence coincidently equals the mass of a suffix subsequence, 
which means the directed path contains both xj and yj , we can remove either Xj or yj from the path 
and form a new path corresponding to multiple peptide sequences which contain the real sequence. 
□ 

We call such a directed path a feasible reconstruction of P or a feasible solution of G. To construct 
G, we use a mass array A, which takes an input of mass m, and returns 1 if m equals the total mass 
of some amino acids; and otherwise. Let h be the maximum mass under construction. Let 5 be the 
measurement precision for mass. Then, 

Theorem 2 Assume that we are given the maximum mass h and the mass precision 5. 

1. The mass array A can be constructed in O(j) time. 

2. With A, G can be constructed in 0(k 2 ) time. 
Proof. These statements are proved as follows. 

Statement ||. Given a mass m, < m < h, A[m] = 1 if and only if m equals one amino acid 
mass, or there exists an amino acid mass r < m such that A[m — r] = 1. If A is computed in the 
order from ^4[0] to .4[§], each entry can be determined in constant time since there are only 20 amino 
acids. The total time is 0(j ). 

Statement |2| For any two nodes Vi and Vj of G, we create an edge for Vi and Vj, E(vi,Vj) = 1, 
if and only if < cord(uj) — cord(wj) < h and A[coxd(vj) — cord^i)] = 1. There are 0(k 2 ) pairs of 
nodes. With A, G can be constructed in 0(k 2 ) time. □ 

In current practice, 5 = 0.01 dalton, and h = 400 daltons, roughly the total mass of four amino 
acids. The efficiency of our algorithm will allow biologists to consider much larger h and much smaller 
5. 

3 Algorithms for peptide sequencing 
3.1 Dynamic programming 

We list the nodes of G from left to right as Xo, X\, . . . ,Xk, yk, ■ ■ ■ , Hi, Ho- Let M(i,j) be a two-dimension 
table with < i,j < k. Let M(i,j) = 1 if and only if in G, there is a path L from xq to Xi and a 
path R from yj to yo, such that L U R contains exactly one of x p and y p for every p G [0, i] U [0, j]. 
Let M{i,j) = otherwise. 

Algorithm Compute-M(G) 

1. Initialize M(0,0) = 1 and M(i,j) = for all i ^ or j ^ 0; 
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2. Compute M(1,0) and M(0, 1); 

3. For j = 2 to k 

4. For i = to j - 2 

(a) if M(i, j - 1) = 1 and E{x ij x j ) = 1, then M(j, j - 1) = 1; 

(b) if Af(i, j - 1) = 1 and Efayj-i) = 1, then M(i, j) = 1; 

(c) if M(j - = 1 and E(xj-i,Xj) = 1, then M(j',i) = 1; 

(d) if M(j - 1, i) = 1 and ift) = 1, then M(j - 1, j) = 1. 

Lemma 3 Given G(V,E), Algorithm Compute-M computes the table M in 0(\V\ 2 ) time. 

Proof. Let L and R be the paths that correspond to M{i,j) = 1. If i < j, by definition, after 
removing node yj from R, L U R — {yj} contains exactly one of x q and y q for all 1 < q < j — 1. If 
(yj,y p ) G i?, then M(i,p) = 1, and either p = j — 1 or i = j — 1, which corresponds to Step 4(b) 
or 4(d) respectively in the algorithm, because either Xj-i or yj—i, but not both, is in L U R. A 
similar analysis holds for the cases of Step 4(a) or 4(c). The loop at Step 3 uses previously computed 
M(0, j-l),...,M(j- l,j - 1) and M(j - 1, 0), . . . , M(j - I, j - 1) to fill up M(0, j),..., M(j, j) and 
M(j, 0), . . . ,M(j,j). Thus the algorithm computes M correctly. Note that |V| = 2k + 2 and Steps 
4(a), 4(b), 4(c), and 4(d) take 0(1) time, and thus the total time is 0(|y| 2 ). □ 

Theorem 4 The following statements hold. 

1. Given G = (V,E) and M, a feasible solution of G can be found in 0{\V\) time. 

2. Given G = (V,E), a feasible solution of G can be found in 0(\V\ 2 ) time and 0(\V\ 2 ) space. 

3. Given G = (V, E) , all feasible solutions of G can be found in 0{\V\ 2 + n\V\) time and 0(\V \ 2 + 
n|y|) space, where n is the number of solutions. 

Proof. These statements are proved as follows. 

Statement |l]. Note that |V| = 2k + 2. Without loss of generality, assume that a feasible solution 
S contain node Xk- Then there exists some j < k, such that (x^, yj) is an edge in S and M(k,j) = 1. 
Therefore, we search the non-zero entries in the last row of M and find a j that satisfies both 
M(k,j) = 1 and E(x k ,yj) = 1. This takes 0(\V\) time. With M(k,j) = 1, we backtrack M to 
search the next edge of S as follows. If j = k — 1, the search starts from i = k — 2 to until both 
E(xi,Xk) = 1 and M(i,j) = 1 are satisfied; otherwise j < k — 1, and then E(xk-i,Xk) = 1 and 
M(k — = 1. We repeat this process to find every edge of S. The process visits every node of G 
at most once in the order from Xk to xq and from to yo. The total cost is 0(|V|) time. 

Statement ||. We compute M by means of Lemma || and find a feasible solution by means of 
Statement |]. The total cost is 0(|T^| 2 ) time and 0(|y| 2 ) space. 

Statement ||. The proof is similar to that of Statement [l]. We can find all the feasible solutions 
by backtracking M, and each feasible solution costs 0(|V|) time and 0(|V|) space. Computing M 
and finding n solutions cost OdV^j 2 + n|V|) time and 0(| V^| 2 + n|V|) space in total. □ 

3.2 An improved algorithm 

To improve the time and space complexities in Theorem ||, we encode M into two linear arrays. 
Define an edge (xi,yj) with < i,j < Ho be a cross edge, and an edge (xi,Xj) or (yj,yi) with 
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< i < j < k to be an inside edge. Let lce(z) be the length of the longest consecutive inside edges 
starting from node z; i.e., 

j — i if E{xi, Xj-fi) = . . . = E(xj-\, Xj) = 1 and (j = k or E(xj, Xj+i) = 0); 
i - j if E(y i ,y i - 1 ) = ... = E(y j+1 ,yj) = 1 and (j = or E(y j ,y j ^i) = 0). 

Let dia(z) be two diagonals in M, where 

dia(xj) = M(j,j - 1) for < j < k; 
di&iyj) = M{j - 1, j) for < j < k; 
dia(x ) = dia(yo) = 1. 

Lemma 5 Given lce(-) and dia(-), any entry of M can be computed in 0(1) time. 

Proof. Without loss of generality, let the M(i,j) be the entry we want to compute where 
< i < j < k. If i = j — 1, M(i,j) = dia(yj) as defined; otherwise i < j — 1 and M(i,j) = 1 if and 
only if M(i, i+ 1) = 1 and E(yj,yj_i) = . . . = £ , (y, + 2, = 1, which is equivalent to dia(yi + i) = 1 
and lce(g/j) > j — i — 1. Thus both cases can be solved in O(l) time. □ 

Lemma 6 Given G = (V,E), lce(-) and dia(-) can 6e computed in 0(\V\ + iime. 

Proof. We retrieve consecutive edges starting from y^, yt-i, • • •, until the first y p with p < k 
and E(y p ,y p -i) = 0. Then we can fill lce(yfc) = k — p, lce(yfe_i) = — p — 1, . . ., and lce(y p ) = 
immediately. Next, we start a new retrieving and filling process from y p -i, and repeat this until yo is 
visited. Eventually we retrieve 0(k) consecutive edges. A similar process can be applied to x. Using 
a common graph data structure such as a link list, a consecutive edge can be retrieved in constant 
time, and thus lce(-) can be computed in 0(|V[) time. 

By definition, dia(xj) = M(j,j—1) = 1 if and only if there exists some i with < i < M(i,j— 
1) = 1, and E(xi,Xj) = 1. If we have computed dia(xo), • • • , dia(xj_i) and dia(y 3 _i), . . . , dia(yo), 
then M(i,j — 1) can be computed in constant time by means of the proof in Lemma [|. To find the 
Xi for E(xi,Xj) = 1, we can visit every inside edge that ends at Xj. Therefore the computation of 
dia(-) visits every inside edge exactly once, and the total time is 0(|V| + \E\). □ 

Theorem 7 Assume that G(V, E) is given. 

1. A feasible solution of G can be found in 0(\V\ + \E\) time and 0(|V|) space. 

2. All feasible solutions of G can be found in 0{n\V\ + \E\) time and 0{n\V\) space, where n is 
the number of solutions. 

Proof. These statements are proved as follows. 

Statement |l[ By Lemma [], lce(-) and dia(-) can be computed in 0(|V| + \E\) time and 0(|V|) 
space. By Lemma ||, the last row and the last column of M can be reconstructed from Ice and dia 
in 0(| V|) time. By Theorem ^| and Lemma |5[ a feasible solution of G can be found in 0(|.E|) time. 
Therefore, finding a feasible solution takes 0(|y| + \E\) time and 0(|V|) space. 

Statement ^. The proof is similar to the proof of Statement || in Theorem ||. Finding an additional 
feasible solution takes 0(|V|) time and 0(|V|) space. Thus finding n solutions takes 0(n|V| + l^l) 
time and 0(n|V|) space. □ 

A feasible solution of G is a path of k + 1 nodes and k edges, and therefore there must exist an 
edge between any two nodes on the path by the edge transitive relations. This implies that there are 
at least (k + l)k/2 or 0(\V\ 2 ) ed ges in the graph. However, in practice, a threshold is usually set 
for the maximum length (mass) of an edge, so the number of edges in G could be much smaller than 
0(|U| 2 ) and may actually equal 0(|V|) sometimes. 
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4 Algorithms for noisy data 



4.1 Amino acid modification 

Amino acid modifications are related to protein functions. For example, some proteins are active 
when phosphorylated but inactive when dephosphorylated. Although there are a few hundred known 
modifications, a peptide rarely has two or more modified amino acids. This section discusses how to 
find the position of a modified amino acid from a tandem mass spectral data. We assume that the 
modified mass is unknown and is not equal to the total mass of any number of amino acids; otherwise, 
it is information-theoretically impossible to detect an amino acid modification from tandem mass 
spectral data. 

Lemma 8 The amino acid modification problem is equivalent to the problem which, given G = (V, E), 
asks for two nodes Vi and Vj, such that E(vi,Vj) = but adding the edge (vi,Vj) to G creates a feasible 
solution that contains this edge. 

Proof. Similar to Lemma |]. □ 

Let G = (V, E) be an NC-spectrum graph with nodes from left to right as xq, . . . , x^, Uk, ■ ■ ■ , yo- 
Let N(i,j) be a two-dimension table with < i,j < k, where N(i,j) = 1 if and only if there is a 
path from Xi to y.j which contains exactly one of x p and y p for every p £ [i, k] U [j, k}. Let N(i,j) = 
otherwise. 

Algorithm Compute-N(G) 

1. Initialize N(i,j) = for all i and j; 

2. Compute N(k, k - 1) and N(k - 1, k); 

3. For j = k - 2 to 

4. For i = k to j + 2 

(a) if N(i,j + 1) = 1 and E(x j ,x i ) = 1, then N(j,j + 1) = 1; 

(b) if N(i,j + 1) = 1 and E{y j+1 , yj ) = 1, then N{i,j) = 1; 

(c) if N(j + l,i) = 1 and E(xj, x j+1 ) = 1, then N(j,i) = 1; 

(d) if N(j + = 1 and E( yi ,y j+1 ) = 1, then N(j + 1, j) = 1. 

Lemma 9 Given G = (V,E), Algorithm Compute-N computes the table N in 0(\V\ 2 ) time. 

Proof. Similar to Lemma ||. □ 

Theorem 10 Given G = (V, E) which contains all prefix and suffix nodes, all possible amino acid 
modifications can be found in 0(\V\\E\) time and 0(\V\ 2 ) space. 

Proof. Let M and be two tables for G computed from Lemma ^ and |9[ Without loss of 
generality, let the modification be between two consecutive prefix nodes Xi and Xj with < i < j < k 
and E(xi,Xj) = 0. All the prefix nodes to the right of Xj have the same mass offset from the normal 
locations because the corresponding sequences contain the modified amino acid. By adding a new 
edge (x^ xj) to G, we create a feasible solution S that contains this edge. If i + 1 < j, then y.i + i G S, 
and thus M(i,i + 1) = 1 and N(j,i + 1) = 1. There are 0(k 2 ) possible combinations of i and j, 
and checking all of them takes 0(|y| 2 ) time. If i + 1 = j, then S must contain an edge (y q ,y p ) with 
q > j > i > p, which skips over yi and yj. S can be found if E(y q ,y p ) = 1 and M(i,p) = 1 and 
N(j,q) = 1. There are at most 0(|-E|) edges, which can be examined in 0(|i?|) time. Checking 
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0{\V\) possible i+l=j costs 0(|V||£|) time. The total complexity is 0(|V||^|) time and 0(|F| 2 ) 
space. □ 

Note that the condition in Theorem does not require that all ions in the spectrum are observed. 
If some ions are lost but their complementary ions appear, G still contains all prefix and suffix nodes 
of the target sequence. Furthermore, if G does not contain all prefix and suffix nodes because of 
many missing ions, we can still use this algorithm to find the modification but the result depends on 
the quality of the data and the modified mass. 

4.2 Using scoring functions 

In practice, a tandem mass spectral data may contain noise such as mass peaks of other types of ions 
from the same peptide, mass peaks of ions from other peptides, and mass peaks of unknown ions. A 
common way to deal with these situations is to use a pre-defined edge scoring function s(-). With 
s, the score of a path is the sum of the scores of the edges on the path. We re-define the peptide 
sequencing problem, which given an NC-spectrum graph G = (V, E), asks for a maximum score path 
from xq to yo) such that at most one of Xj and yj for every 1 < j < k is on the path. 

Let Q(i,j) be a two-dimension table with < i,j < k. Q(i,j) > if and only if in G, there is a 
path L from xq to X{ and a path R from yj to yo, such that at most one of x p and y p is in L U R for 
every p € [0,i] U [0, j]; Q(i,j) = otherwise. If Q(i,j) > 0, let Q(i,j) be the maximum score among 
all L and R pairs. 

Algorithm Compute-Q (G) 

1. Initialize Q(0,0) = 1 and Q(i,j) = for all i ^ or j ^ 0; 

2. For j = 1 to k 

3. For i = to j — 1 

(a) For every E{y jl y p ) = 1 and Q(i,p) > 0, Q{i,j) = max{Q(i,j),Q(i,p) + s(yj,y p )}; 

(b) For every E(x p ,Xj) = 1 and Q(p,i) > 0, Q(j,i) = mayi{Q(j,i),Q(p,i) + s(x p ,Xj)}. 

Lemma 11 Given G = (V,E), Algorithm Compute-Q computes the table Q in 0(\V\\E\) time. 

Proof. The correctness proof is similar to that for Lemma ||. For every j, Steps 3(a) and 3(b) 
visit every edge of G at most once, so the total time is 0(|V||.E|). □ 

Theorem 12 Given G = (V,E), a feasible solution of G can be found in 0(\V\\E\) time and 0(\V\ 2 ) 
space. 

Proof. Algorithm Compute-Q computes Q in 0(|y||2?|) time and 0(|y| 2 ) space. For every i and 
j, if Q(i,j) > and E(xi,yj) = 1, we compute the sum Q(i,j) + s(xi,yj). Let Q(p,q) + s(x p ,y q ) be 
the maximum value, and we can backtrack Q(p, q) to find all the edges of the feasible solution. The 
total cost is Od^H-El) time and 0(|F| 2 ) space. □ 

5 Experimental results 

We have presented algorithms for reconstructing peptide sequences from a tandem mass spectral data 
with loss of ions. This section reports experimental studies which focus on cases of b-ions losing a 
water or ammonia molecule and cases of isotopic varieties for an ion. We treat the rare occurrence 
such as y-ions losing a water or ammonia molecule, b-ions losing two water or ammonia molecules, 
and other types of ions, as noise and apply Algorithm Compute-Q to reconstruct peptide sequences. 
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Figure 3: Raw tandem mass spectrum and predicted ions of the Chicken Ovalbumin peptide 
GGLEPINFQTAADQAR. 



Isotopic ions come from isotopic carbons of C and C . An ion usually has a couple of isotopic 
forms, and the mass difference between two isotopic ions is generally one or two daltons. Their 
intensities reflect the binomial distribution between C 12 and C 13 . This distribution can be used for 
identification. Isotopic ions can be merged to one ion of either the highest intensity or a new mass. 

It is very common for a b-ion to lose a water or ammonia molecule. In the construction of an 
NC-spectrum graph, we add three types of edges whose lengths equal the masses of a water molecule, 
amino acids minus one water, and amino acids plus one water respectively. In Algorithm Compute-Q, 
we restrict the net number of waters at each entry to be at most one, since a feasible solution should 
have a net of zero water. We have implemented this algorithm and tested it on the data generated 
by the following process: 

The Chicken Ovalbumin proteins were digested with trypsin in 100 mM ammonium bi- 
carbonate buffer pH 8 for 18 hours at 37° C. Then 100 fj£ are injected in acetonitrile into 
a reverse phase HPLC interfaced with a Finnigan LCQ ESI-MS/MS mass spectrometer. 
A 1% to 50% acetonitrile 0.1%TFA linear gradient was executed over 60 minutes. 

Figure || shows one of our prediction results. The ions labeled in the spectrum were identified 
successfully. We use resolution 1.0 dalton and relative intensity threshold 5.0 in our program. More 
experimental results will be shown in the full version of this paper. 



6 Further research 

There are many open problems. Perhaps the most interesting direction would be to consider the case 
of multiple peptides. 
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